Tag
4 articles
GPT-5.5 tops AI benchmarks but still hallucinates frequently, and its API cost has risen by 20%.
Alibaba's Qwen3.6 outperforms Google's Gemma 4 in agentic coding benchmarks, showcasing the power of efficient AI architectures.
This article explains how human disagreement in AI benchmarking can lead to unreliable performance metrics and why current practices need to evolve to account for annotation variability.
Cohere has released an open-source speech recognition model that outperforms industry leader OpenAI's Whisper in benchmark tests.